Credit Assignment

Going back to the gradient estimate, we can take a closer look at the total reward R, which is just a sum of reward at each step R=r_1+r_2+…+r_{t-1}+r_t+…

g=\sum_t (...+r_{t-1}+r_{t}+...)\nabla_{\theta}\log \pi_\theta(a_t|s_t)

Let’s think about what happens at time-step t. Even before an action is decided, the agent has already received all the rewards up until step t-1. So we can think of that part of the total reward as the reward from the past. The rest is denoted as the future reward.

(\overbrace{...+r_{t-1}}^{\cancel{R^{\rm past}_t}}+ \overbrace{r_{t}+...}^{R^{\rm future}_t})

Because we have a Markov process, the action at time-step t can only affect the future reward, so the past reward shouldn’t be contributing to the policy gradient.
So to properly assign credit to the action a_t, we should ignore the past reward. So a better policy gradient would simply have the future reward as the coefficient .

g=\sum_t R_t^{\rm future}\nabla_{\theta}\log \pi_\theta(a_t|s_t)

Notes on Gradient Modification

You might wonder, why is it okay to just change our gradient? Wouldn't that change our original goal of maximizing the expected reward?

It turns out that mathematically, ignoring past rewards might change the gradient for each specific trajectory, but it doesn't change the averaged gradient. So even though the gradient is different during training, on average we are still maximizing the average reward. In fact, the resultant gradient is less noisy, so training using future reward should speed things up!